Introduction

This data set includes information about different Netflix shows and movies such as genre, run time, reviews from different sources, age certification, and year to name a few. The goal of this project is to predict the imdb_score of a movie from this set of factors. Netflix is an online streaming platform that lets the user select any show or movie to watch so they can watch on demand at any time. There are thousands of pieces of media on the platform and each show/movie has an accompanying IMDB score that is rated on a separate platform called IMDB, Internet Movie Database. This database has all the movies ratings, including reviews, votes cast by people and popularity score of that piece of media. This project aims to use machine learning in order to determine if I can make a model that can precisely predict the IMDB score of a film/show available on Netflix.

Why Netflix Data?

My family loves movies, my sister especially, so I knew I wanted to make a project that predicts the IMDB score of a movie to see how much a movies score really relies on performance and not themes or popularity. There are certain movies that just always do bad in theaters, like superhero movies or kids movies, so I was interested to see what actually makes a good movie or show. Netflix has also been my favorite streaming platform for a long time now, so I wanted to pull data from this platform. Sometimes I hear about a movie being “so good!” and then when I watch it I think it is not great and doesn’t deserve the hype. I am going to see whether the themes of show or the popularity of the show actually make a difference when it comes the the IMDB score.

Stranger Things Poster
Stranger Things Poster
# Assigning the data to a variable
netflix <- read.csv(file = '/Users/mackenzie/Documents/pstat 131/netflix data/titles.csv')

Data Citation

I obtained this data set from kaggle. This data set was created to list all shows/movies available on Netflix streaming and was acquired in July 2022 containing data available in the United States. The data was obtained by user Victor Soeiro.

https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?select=titles.csv

Exploring and Tidying the Raw Data

Variable Selection

Let’s look at our data and see what we’re working with!

# Calling head() to see the first few rows
head(netflix)
##         id                               title  type
## 1 ts300399 Five Came Back: The Reference Films  SHOW
## 2  tm84618                         Taxi Driver MOVIE
## 3 tm154986                         Deliverance MOVIE
## 4 tm127384     Monty Python and the Holy Grail MOVIE
## 5 tm120801                     The Dirty Dozen MOVIE
## 6  ts22164        Monty Python's Flying Circus  SHOW
##                                                                                                                                                                                                                                                                                                                                                                                                                       description
## 1                                                                                                                                                                                                                                                                         This collection includes 12 World War II-era propaganda films — many of which are graphic and offensive — discussed in the docuseries "Five Came Back."
## 2                                                                                                                                                                                                                                                           A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action.
## 3                                                                                                                                                                                                           Intent on seeing the Cahulawassee River before it's turned into one huge lake, outdoor fanatic Lewis Medlock takes his friends on a river-rafting trip they'll never forget into the dangerous American back-country.
## 4 King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to enter, as "it is a silly place".
## 5                                                                                                                  12 American military prisoners in World War II are ordered to infiltrate a well-guarded enemy château and kill the Nazi officers vacationing there. The soldiers, most of whom are facing death sentences for a variety of violent crimes, agree to the mission and the possible commuting of their sentences.
## 6                                                                                                                                                                                                                                                          A British sketch comedy series with the shows being composed of surreality, risqué or innuendo-laden humour, sight gags and observational sketches without punchlines.
##   release_year age_certification runtime
## 1         1945             TV-MA      51
## 2         1976                 R     114
## 3         1972                 R     109
## 4         1975                PG      91
## 5         1967                       150
## 6         1969             TV-14      30
##                                        genres production_countries seasons
## 1                           ['documentation']               ['US']       1
## 2                          ['drama', 'crime']               ['US']      NA
## 3 ['drama', 'action', 'thriller', 'european']               ['US']      NA
## 4             ['fantasy', 'action', 'comedy']               ['GB']      NA
## 5                           ['war', 'action']         ['GB', 'US']      NA
## 6                      ['comedy', 'european']               ['GB']       4
##     imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
## 1                   NA         NA           0.600         NA
## 2 tt0075314        8.2     808582          40.965      8.179
## 3 tt0068473        7.7     107673          10.010      7.300
## 4 tt0071853        8.2     534486          15.461      7.811
## 5 tt0061578        7.7      72662          20.398      7.600
## 6 tt0063929        8.8      73424          17.617      8.306

Most of these columns will not be important when it comes to fitting a model, so I will drop those variables right now. The movie id and imdb_id are not needed if I have the movie title, so I will drop those values. I will also drop description, age_certification, and production_countries since those values will not be helpful when it comes to analysis. My goal is to predict the imdb_score, so I will drop the tmdb_score because those two values are very similar and I don’t want my model to predict imdb_score directly from tmdb_score. I will also drop the amount of seasons, since most of the data for those values are missing.

The genre variable has many different values, which is not helpful in our calculations. Ideally there would only be one value for this variable and since the genres are listed in order of what genre is most descriptive for that movie, we will take the first genres in the list of genres.

# Splitting genres into just the first element in the list
netflix$genres <- gsub("\\[|\\]|'", "", netflix$genre)
netflix$genres <- sapply(strsplit(netflix$genre, ","), function(x) trimws(x[1])) 
# Removing variables that won't be helpful for analysis
netflix <- netflix %>%
  dplyr::select(-title, -id, -imdb_id, -description, -age_certification, -production_countries, -tmdb_score, -seasons) %>%
  # Turning categorical variables into factors
  mutate(type = factor(type), genres = factor(genres))

Checking the Size of the Data

I will check the size of the data to see if I have to cut any data out of the data set, since too many observation in the data will lead to overfitting.

# Calling dim() to see dimensions of Netflix data
dim(netflix)
## [1] 5850    7

There are 5,850 observations and 8 variables, 5,850 rows and 8 columns, in this data set. The amount of observations seems very high for our analysis, so I am going to look to see if there are any observations that should be eliminated due to missing values. Once I check for missing values, I may still have to cut down the amount of observations in order to continue my analysis.

Check for Missing Values

# See if theres any missing data
sum(is.na(netflix))
## [1] 1130

There are a lot of missing values in this dataset, so let’s see where those values lie and what I can do to eliminate these values.

# Calling summary() function to see where there are missing values
netflix %>%
  summary()
##     type       release_year     runtime                 genres    
##  MOVIE:3744   Min.   :1945   Min.   :  0.00   drama        :1421  
##  SHOW :2106   1st Qu.:2016   1st Qu.: 44.00   comedy       :1305  
##               Median :2018   Median : 83.00   documentation: 665  
##               Mean   :2016   Mean   : 76.89   thriller     : 377  
##               3rd Qu.:2020   3rd Qu.:104.00   action       : 365  
##               Max.   :2022   Max.   :240.00   (Other)      :1658  
##                                               NA's         :  59  
##    imdb_score      imdb_votes        tmdb_popularity    
##  Min.   :1.500   Min.   :      5.0   Min.   :   0.0094  
##  1st Qu.:5.800   1st Qu.:    516.8   1st Qu.:   2.7285  
##  Median :6.600   Median :   2233.5   Median :   6.8210  
##  Mean   :6.511   Mean   :  23439.4   Mean   :  22.6379  
##  3rd Qu.:7.300   3rd Qu.:   9494.0   3rd Qu.:  16.5900  
##  Max.   :9.600   Max.   :2294231.0   Max.   :2274.0440  
##  NA's   :482     NA's   :498         NA's   :91

Since there are 5,850 observations in this data set, we are able to remove the observations that do not contain all the information I will need for my analysis. It looks like there are missing values from imdb_score, imdb_votes and tmdb_popularity, so I will remove all the observations that have at least one missing value from the data set.

# Removing observations with missing values
netflix <- netflix %>%
  na.omit() 

# Seeing the new dimensions of Netflix
dim(netflix)
## [1] 5277    7

Now there are no more missing values in the data set. There are still 5,277 obervations in the data set, so I will cut the data to be more manageable. A smaller amount of observations will allow the models to not overfit to the data, so I am only going to take the first 300 observations. I also conducted my analysis with the full data set and the correlation matrix showed little correlation between values, so I decided to take a smaller amount of observations in order to see correlations.

# Trimming netflix to 300 observations
netflix <- netflix[1:300,]
dim(netflix)
## [1] 300   7

Exploratory Data Analysis

Stranger Things Cast Investigating
Stranger Things Cast Investigating

Visual EDA

Correlation Plot

netflix %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = "lower", diag = FALSE)

There is a positive correlation between the amount of imdb_votes and imdb_score, which means that imdb_votes is a good indicator of the score. There is also a positive correlation between imdb_popularity score and imcb_votes, which means that the more popular the show the more votes will be cast on the movie. The popularity of the show and the number of votes are also positively correlated, which means that the more popular a show/movie, the more votes the show/movie will receive. There is a negative correlation between run time and both popularity and IMDB score. This means that as the runtime increases the popularity and rating of the show/movie will decrease.

IMDB Votes vs. Popularity

Since there is a slight positive correlation between tmdb_popularity and imdb_votes, I will make a graph comparing those two variables.

netflix %>%
  ggplot(aes(x= tmdb_popularity, y = imdb_votes)) +
  geom_point() +
  geom_smooth(method= 'lm', se = FALSE) +
  labs(x = 'Popularity Score', y = 'Total Number of Votes', title = 'Comparison of Popularity and Votes') +
  theme_minimal()

Some of the movies are much more popular than others, so it is hard to see the pattern between the points. I have fitted a regression line in order to see the relationship more clearly. As was predicted by the correlation plot, we can see that as tmdb_popularity increase imdb_votes increases. This is important because it indicates that popularity and votes may be important factors when it comes to determining IMDB score.

IMDB Score by Genre

I will now explore how many genres there are in the data set and how many of each occur, since there may be a correlation between genre and popularity score.

# Make a frame of average imdb_score and associating genre
avgscr_genre <- netflix %>%
  group_by(genres) %>%
  summarise(average = mean(imdb_score, na.rm = TRUE))

# Plot Average imdb_score by genre
avgscr_genre %>%
  ggplot(aes(x = genres, y = average, fill = genres)) +
  coord_flip() +
  geom_bar(stat = 'identity') +
  scale_fill_discrete() +
  labs(y = 'IMDB Score', x = 'Genres', title = 'Average IMDB Score By Genre')

Note: Colors Added for Aesthetic Purposes Only

As we can see from the graph, the average mean score across genres does not change very much. There are big discrepancies between the highest rated on average, which happens to be animation and the lowest rated, horror. This graph tells us there are differences in IMDB score by genre, but it doesn’t provide a lot of useful evidence to decide that genre is a deciding factor when it come to IMDB score.

Setting Up the Models

Splitting the Data

I will split the data into about 70% training and 30% testing while also stratifying on the outcome variable, imdb_score.

# Split the data and stratify on imdb_score
netflix_split <- initial_split(netflix, strata = imdb_score, prop = 0.7)
net_train <- training(netflix_split)
net_test <- testing(netflix_split)

We will verify that we split the data correctly.

# See the proportions of the training and testing sets
nrow(net_train)/nrow(netflix)
## [1] 0.6933333
nrow(net_test)/nrow(netflix)
## [1] 0.3066667

The training set has 69.3% of the data and the testing set has 30.7% of the data, this is very close to the values that we want of 70% and 30%. This means that we split the data correctly into training and testing sets. We also stratified the training and testing sets on the outcome variable, imdb_score, by including strata = imdb_score.

Making a Recipe

In order to continue our analysis, we will need to set up a recipe that will be used for each model. We can modify the recipe later if needed, since some models have different requirements for the recipe. For example, some models, like the elastic net model, require the data to be normalized in order to fit. We set up our recipe using the net_train data, which is the training data that we created above. The outcome variable imdb_score is being predicted by the variables type, genres, release_year, runtime, imdb_votes and tmdb_popularity. I have also included step_center() and step_scale() in order to normalize the data, since most models require the data be normalized before analysis.

We have a couple categorical variables in this data set, genres and type. Since the categorical variables are not continuous, we must convert them into numeric values. In order to convert these to numeric values for analysis, I will dummy code them, which assigns a numeric value to each value and creates a column for each unique value. This means that for each unique genre, there will be a separate column and a unique value given to the observations that have the genre value in that unique genres column.

I have also included a prep() and bake() step in order to see the recipe I created.

# Creating recipe
netflix_recipe <- recipe(imdb_score ~ ., data = net_train) %>%
  # Dummy coding categorical variable - genres
  step_dummy(type, genres) %>%
  # Dummy coding all nominal predictors 
  step_dummy(all_nominal_predictors(), -all_numeric_predictors()) %>%
  # Removing variables with zero variance
  step_zv(all_predictors()) %>%
  # Normalizing
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) 

# Let us prep and bake the recipe!
netflix_recipe %>%
  prep() %>%
  bake(new_data = net_train) %>%
  head()
## # A tibble: 6 × 21
##   release_year runtime imdb_votes tmdb_popularity imdb_score type_SHOW
##          <dbl>   <dbl>      <dbl>           <dbl>      <dbl>     <dbl>
## 1        -1.28   1.40      -0.499          -0.429        2.1    -0.620
## 2        -1.36   1.09      -0.497          -0.450        6.2    -0.620
## 3        -3.03   1.34      -0.499          -0.477        6.4    -0.620
## 4        -1.72   0.336     -0.499          -0.474        6.4    -0.620
## 5        -1.45   0.983     -0.499          -0.472        4.4    -0.620
## 6        -1.19   0.462     -0.469          -0.198        4.9    -0.620
## # ℹ 15 more variables: genres_animation <dbl>, genres_comedy <dbl>,
## #   genres_crime <dbl>, genres_documentation <dbl>, genres_drama <dbl>,
## #   genres_family <dbl>, genres_fantasy <dbl>, genres_history <dbl>,
## #   genres_horror <dbl>, genres_reality <dbl>, genres_romance <dbl>,
## #   genres_scifi <dbl>, genres_thriller <dbl>, genres_war <dbl>,
## #   genres_western <dbl>

K-Fold Cross Validation:

# Creating folds and strtifying on outcome variable
netflix_folds <- vfold_cv(net_train, v = 10, strata = imdb_score)

I am utilizing K-fold cross validation in order to reduce variance in the data, which will reduce bias and give me more unbiased estimates of each of my models performance on the training data. K-fold cross validation allows us to take the average accuracy from multiple runs of training and testing data within the folds of training data instead of just one average from the original training and testing sets. I also stratify the folds on my outcome variable imdb_score, since I don’t want the data to become unbalanced in the folds.

Model Building

Elle and Max
Elle and Max

Now that we have tidied the data, explored the data, and set up our recipe, we can build our models! The metric I have chosen to evaluate each of the models is the Root Mean Squared Error (RMSE). Since we are utilizing regression analysis to determine imdb_score, RMSE is the best metric to determine the validity of our models. RMSE is a consistent metric among all the regression models we will be fitting, so it is the easiest metric to use in order to determine which model fits the training data the best.

Fitting the Models

Each of the models follows a similar format. That means that all the models are created in a similar pattern, so I will organize this section by each step to fit the models.

# Linear Regression
lm_model <- linear_reg() %>%
  set_mode('regression') %>%
  set_engine('lm')

# K- Nearest Neighbors
# tuning the number of neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>%
  set_mode('regression') %>%
  set_engine('kknn')

# Ridge Regression
# tuning penalty and setting mixture = 0 to get ridge
rd_model <- linear_reg(mixture = 0, penalty = tune()) %>%
  set_mode('regression') %>%
  set_engine('glmnet')

# Elastic Net Regression
# tuning penalty and mixture
en_model <- linear_reg(penalty = tune(), mixture = tune()) %>%
  set_mode('regression') %>%
  set_engine('glmnet')

# Lasso Regression
# tuning penalty and setting mixture to 1 to get lasso
ls_model <- linear_reg(mixture = 1, penalty = tune()) %>%
  set_mode('regression') %>%
  set_engine('glmnet')

# Random Forest
# tuning mtry (number of predictors used), trees (total amount of trees) and 
# min_n (minimum number of trees per node)
rf_model <- rand_forest(mtry = tune(), 
                        trees = tune(), 
                        min_n = tune()) %>%
  set_mode('regression') %>%
  set_engine('ranger', importance = 'impurity')
  
# Boosted Trees
# tuning trees, learn_rate, and min_n
bt_model <- boost_tree(trees = tune(),
                       learn_rate = tune(),
                       min_n = tune()) %>%
  set_mode('regression') %>%
  set_engine('xgboost') 

Setting up the Workflows

For each model we set up a workflow that includes both model and recipe.

# Linear Regression
lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(netflix_recipe)

# K-Nearest Neighbors
knn_wflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(netflix_recipe)

# Ridge Regression
ridge_wflow <- workflow() %>% 
  add_model(rd_model) %>%
  add_recipe(netflix_recipe)
  
# Elastic Net Regression
elastic_wflow <- workflow() %>%
  add_model(en_model) %>%
  add_recipe(netflix_recipe)

# Lasso Regression
lasso_wflow <- workflow() %>%
  add_model(ls_model) %>%
  add_recipe(netflix_recipe)

# Random Forest
randfor_wflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(netflix_recipe)

# Boosted Trees
boost_wflow <- workflow() %>%
  add_model(bt_model) %>%
  add_recipe(netflix_recipe)

Creating a Tuning Grid

For each model that is being tuned we must specify a range for the hyperparameters being tuned. We do this by specifying a tuning grid and manually enter ranges for each parameter.

# Linear Regression
# No parameters that need tuning - no grid

# K-Nearest Neighbors
knn_grid <- grid_regular(neighbors(range = c(1,10)), levels = 5)

# Ridge Regression/Lasso Regression
rdls_grid <- grid_regular(penalty(range = c(0,1)), levels = 50)

# Elastic Net Regression
en_grid <- grid_regular(penalty(), mixture(range = c(0,1)), levels = 10)

# Random Forest
rf_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(50,500)), min_n(range = c(5,20)), levels = 10)

# Boosted Trees
bs_grid <- grid_regular(trees(range = c(50, 200)), learn_rate(range = c(0.01,0.1), trans = identity_trans()), min_n(range = c(40, 60)), levels = 5)

Tuning the Models

Using the models, workflows, k-fold cross validation folds and our tuning grids for the hyperparameters, we will set up tuning grids in order to tune our parameters.

# Linear Regression
# Doesn't need to be tuned

# K-Nearest Negihbors 
knn_tune <- tune_grid(knn_wflow,
                      resamples = netflix_folds,
                      grid = knn_grid)
# Ridge Regression
rd_tune <- tune_grid(ridge_wflow,
                     resamples = netflix_folds,
                     grid = rdls_grid)
# Elastic Net Regression
en_tune <- tune_grid(elastic_wflow,
                     resamples = netflix_folds,
                     grid = en_grid)
# Lasso Regression
ls_tune <- tune_grid(lasso_wflow,
                     resamples = netflix_folds,
                     grid = rdls_grid)
# Random Forest
rf_tune <- tune_grid(randfor_wflow,
                     resamples = netflix_folds,
                     grid = rf_grid)
# Boosted Trees
bs_tune <- tune_grid(boost_wflow,
                     resamples = netflix_folds,
                     grid = bs_grid)

Saving the Models into RDS files

The models take a long time to run, so we are saving time by saving them into separate RDS files, so when we need the files we can load them into the project again using the read_rds() function.

# Linear Regression
# no tuning grid to save 

# K-Nearest Neighbors
write_rds(knn_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/knn_tune.rds')

# Ridge Regression
write_rds(rd_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/rd_tune.rds')

# Elastic Net Regression
write_rds(en_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/en_tune.rds')

# Lasso Regression
write_rds(ls_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/ls_tune.rds')

# Random Forest
write_rds(rf_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/rf_tune.rds')

# Boosted Trees
write_rds(bs_tune, file = '/Users/mackenzie/Documents/pstat 131/netflix data/bs_tune.rds')

Loading the Saved Files

Loading in the saved RDS files with our function.

# Linear Regression
# no model saved to RDS file

# K-Nearest Neighbors
knn_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/knn_tune.rds')

# Ridge Regression
rd_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/rd_tune.rds')

# Elastic Net Regression
en_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/en_tune.rds')

# Lasso Regression 
ls_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/ls_tune.rds')

# Random Forest
rf_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/rf_tune.rds')

# Boosted Trees
bs_tuned <- read_rds(file = '/Users/mackenzie/Documents/pstat 131/netflix data/bs_tune.rds')

Collecting the Metrics

Collecting the model with the best RMSE to determine which model produced the lowest RMSE.

# Linear Regression
lm_fit <- fit_resamples(lm_wflow, resamples = netflix_folds)
lm_results <- show_best(lm_fit, metric = 'rmse')


# K-Nearest Neighbors
best_neighbor <- select_by_one_std_err(knn_tuned, desc(neighbors), metric = "rmse")


# Ridge Regression
best_ridge <- select_by_one_std_err(rd_tuned, penalty, metric = "rmse")


# Elastic Net Regression
best_elastic <- select_by_one_std_err(en_tuned, penalty, mixture, metric = "rmse")


# Lasso Regression
best_lasso <- select_by_one_std_err(ls_tuned, penalty, metric = "rmse")


# Random Forest
best_randfor <- select_by_one_std_err(rf_tuned, mtry, trees, min_n, metric = "rmse")


# Boosted Trees
best_boost <- select_by_one_std_err(bs_tuned, trees, learn_rate, min_n, metric = "rmse")

Model Results

Stranger Things Cast
Stranger Things Cast

Lets see how our models performed!

# creating a data frame in order to visualize the models RMSE values
model_results <- data.frame(Model = c('Linear Regression', 'K-Nearest Neighbors', 
                                        'Ridge Regression', 'Elastic Net Regression', 
                                        'Lasso Regression', 'Random Forest', 'Boosted Trees'), 
                              RMSE = c(lm_results$mean, best_neighbor$mean, best_ridge$mean, 
                                       best_elastic$mean, best_lasso$mean, best_randfor$mean,
                                       best_boost$mean))
# sort from lowest to highest RMSE
model_results %>%
  arrange(RMSE)
##                    Model      RMSE
## 1          Random Forest 0.8307961
## 2          Boosted Trees 0.9308729
## 3    K-Nearest Neighbors 0.9432765
## 4 Elastic Net Regression 0.9593352
## 5      Linear Regression 0.9616597
## 6       Ridge Regression 0.9738367
## 7       Lasso Regression 1.0753555

Here is a graph that shows the results of the table above. The colors of the bars are for aesthetic purposes.

model_results %>%
  ggplot(aes(x = Model, y = RMSE, fill = Model)) +
  geom_bar(stat = 'identity') +
  scale_fill_manual(values = c('red2', 'orange2', 'yellow3', 'green3', 'blue3', 'purple2', 'pink2')) +
  labs(x = 'Model', y = 'Root Mean Squared Error', 
       title = 'Comparison of Model and RMSE')

In order to select the best model to continue testing, we must choose the model that has the lowest RMSE. The model that performed the best on the training data was the random forest model with a RMSE of 0.83. The random forest model is best since it has the lowest RMSE out of all the models. In order to visualize the RMSE data, I created a graph that compares the values. The next best model is the K-Nearest Neighbors, since that model had the next lowest RMSE value. The worst performing model was the lasso regression model, with it’s high RMSE value of 1.075.

# Plot of Random Forest Model
autoplot(rf_tuned, metric = 'rmse')

As we can see from this plot, the minimal node sizes with lower values had a lower RMSE than those with higher minimal node values. The RMSE values also stayed somewhat consistent between the number of trees in the random forest model. We can also see from the graph, we reach a low RMSE score when the number of predictors is equal to 4. Once the number of predictors goes up to 4 the RMSE score starts to plateau. We picked the best model by using the select_by_one_std_err() function, which selects the best and simplest model within one standard error. Simpler models are easier to use and tend to work better, so we want to select the simplest model.

Results of the Best Model

Here is the statistics for the best model we have selected:

# RMSE for the best model - random forest
best_randfor
## # A tibble: 1 × 11
##    mtry trees min_n .metric .estimator  mean     n std_err .config  .best .bound
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>    <dbl>  <dbl>
## 1     4    50     5 rmse    standard   0.831    10  0.0448 Preproc… 0.801  0.840

The Random Forest Model with 4 predictors, 50 trees and a minimal node of 5 preformed the best with a low RMSE of 0.83!

Last Fit and Final Test

We have tuned all the models and determined that the Random Forest model is the best for this data. Now we need to train the data to the training data once more before we can use it on the testing data. We do this so we can use the new model with the best hyperparameters on the training data. After we have a final trained model, we will assess its performance on the testing data, which it has never seen before.

# Selecting best model based of RMSE
best_rf <- select_best(rf_tuned, metric = 'rmse')
# Finalizing workflow with best model
final_wflow <- finalize_workflow(randfor_wflow, best_rf)
# Fitting the model to training set one last time
final_fit <- fit(final_wflow, data = net_train)

# Saving the new fit
write_rds(final_fit, file = '/Users/mackenzie/Documents/pstat 131/netflix data/final_fit.rds' )

# Loading Final Model
final_fit <- read_rds('/Users/mackenzie/Documents/pstat 131/netflix data/final_fit.rds')

Now we will fit the best model to the testing data and see what the testing RMSE is with the new data.

# Creating metric set for RMSE
metric <- metric_set(rmse, rsq)

# Finding the RMSE on the Testing Data
augment(final_fit, new_data = net_test) %>%
  metric(truth = imdb_score, estimate = .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.859
## 2 rsq     standard       0.443

The best model had a higher RMSE on the testing data than it did on the training data. Our model did well on the testing data giving us a RMSE of 0.859, which is pretty low. We also got a testing \(R^2\) values of 0.443, which means that our model explains 44.3% of the variance in the data. This R-squared value is alright, we would want it to be close to 1, or at least above 0.5 for our model to be considered a good model. This doesn’t mean our model is bad, but it does has limitations when it comes to explaining the variance in the data.

Which Variables actually Effect IMDB Score?

final_fit %>%
  extract_fit_engine() %>%
  vip(aesthetics = list(fill = 'blue4')) + 
  labs(y = 'Importance', x = 'Variable', 
       title = 'Importance of Variables to IMDB Score')

This shows that the IMDB score influenced the most by how many votes it receieves. The movie will do better if more people are voting on its score. As we saw from our correlation plot earlier, the amount of votes a movie receives is positively correlated with the popularity score of the movie. This means that a movie only does good if it is popular. Run time was also a important factor when it came to IMDB score. Run time and IMDB were negatively correlated, which means that as the IMDB score increased, the run time decreased. This makes sense, since viewers are less interested in watching long movies/shows.

Conclusion

The best model that was fit is a Random Forest model, since it produced the lowest RMSE of all the models fit. The next best model was the K-Nearest Neighbors (KNN), which also produced a relatively low RMSE. It makes sense that these two models did well predicting the data. The random forest model is a very flexible model, which means it doesn’t make assumptions about the outcome variable. The KNN model also did very well because this data didn’t have a lot of predictor variables. KNN will tend to over fit to data that has a lot of predictors, so we avoided that by having less predictors. The three worst models were Linear Regression, Ridge Regression and Lasso Regression, which tells me that my data does not have a linear relationship. All three of these models assume that the data follows a linear relationship between the predictor variables and the outcome variable. Since they all did poorly compared to the other models, we can assume that my data does not follow a linear relationship between the predictor variables and the outcome variable. The data is non-linear which adds to the reasons why the Random Forest model predicted the outcome variable so well.

After going through my EDA section of the project, I discovered that there was little affect of genres on the IMDB score of the movie. This makes sense, but confirms that any movie can be good or bad depending on other factors. This project also confirmed the belief that movies that are popular do better. Popularity and the number of votes are two of the three most important factors when it comes to predicting IMDB score. This means that the IMDB score is dependent on the popularity of the show. We could also ask what makes the show popular? Is the popularity dependent on the director, producer and/or actor/actress? These are things that would require more research into the statistics of each director, producer, and actor/actress. We could gather information about the amount of movies they’ve been in, screen time, social media following or average IMDB score of movies that they have starred/worked on. For now, we can assume that if a show is very popular the IMDB score will increase, meaning that the popularity is the most important factor when it comes to movies/shows. This also proves that IMDB is biased towards popular movies/shows. If you rank the best movies according to IMDB you are just going to get a list of the most popular movies and not whether the movie is actually good or bad.

Thank you for your time :)